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An  Adaptive  Opportunistic  Routing  Scheme  for 
Wireless  Ad-hoc  Networks 


A.A.  Bhorkar,  M.  Naghshvar,  T.  Javidi  and  B.D.  Rao 
Department  of  Electrical  Engineering, 
University  of  California  San  Diego,  CA,  92093 
{abhorkar,  naghshvar,  tjavidi,  brao}@  ucsd.edu 


Abstract — In  this  paper,  an  adaptive  opportnnistic  routing 
scheme  for  mniti-hop  wireless  ad-hoc  networks  is  proposed.  The 
proposed  scheme  utilizes  a  reinforcement  learning  framework  to 
achieve  the  optimal  performance  even  in  the  absence  of  reliable 
knowledge  about  channel  statistics  and  network  model.  This 
scheme  is  shown  to  be  optimal  with  respect  to  an  expected  average 
per  packet  cost  criterion. 

The  proposed  rooting  scheme  jointly  addresses  the  issues  of 
learning  and  routing  in  an  opportunistic  context,  where  the 
network  structure  is  characterized  by  the  transmission  success 
probabilities.  In  particular,  this  learning  framework  leads  to 
a  stochastic  routing  scheme  which  optimally  “explores”  and 
“exploits”  the  opportunities  in  the  network. 

I.  Introduction 

Opportunistic  routing  for  multi-hop  wireless  ad-hoc  net¬ 
works  has  seen  recent  research  interest  to  overcome  deficien¬ 
cies  of  conventional  routing  [l]-[6]  as  applied  in  wireless 
setting.  Opportunistic  routing  decisions  are  made  in  an  on-line 
manner,  choosing  the  next  relay  based  on  the  actual  trans¬ 
mission  outcomes  as  well  as  a  rank  ordering  of  relays.  This 
on-line  and  sample-path  dependent  structure  of  opportunistic 
schemes  improves  the  performance  of  routing  by  exploiting 
the  broadcast  nature  of  wireless  transmissions  as  well  as  the 
inherent  path  and  multi-user  diversity  present  in  a  network. 

The  authors  in  [1],  [6]  provided  a  Markov  decision  theoretic 
formulation  for  opportunistic  routing.  In  particular,  it  is  shown 
that  the  optimal  routing  decision  at  any  epoch  is  to  select  the 
next  relay  node  based  on  an  index  summarizing  the  expected- 
cost-to-forward  from  that  node  to  the  destination.  This  index 
is  shown  to  be  computable  in  a  distributed  manner  and  with 
low  complexity  using  the  probabilistic  description  of  wireless 
links.  The  study  in  [1],  [6]  provides  a  unifying  framework  for 
almost  all  versions  of  opportunistic  routing  such  as  SDF  [2], 
GeRaF  [3]  and  EXOR  [4].' 

The  opportunistic  algorithms  proposed  in  [l]-[6]  implicitly 
depend  on  a  precise  probabilistic  model  of  wireless  connec¬ 
tions  and  local  topology  of  the  network.  In  practical  setting, 
however,  these  probabilistic  models  have  to  be  “learned”  and 

This  work  was  partially  supported  by  the  the  UC  Discovery  Grant  #com07- 
10241,  Intel  Corp.,  QUALCOMM  Inc.,  Texas  Instruments  Inc.,  and  CWC  at 
UCSD,  and  NSF  CAREER  Award  CNS-0533035. 

’The  variations  in  [2]-[4]  are  due  to  the  authors’  choices  of  cost  measures 
to  optimize.  For  instance  an  optimal  route  in  the  context  of  EXOR  is  computed 
so  as  to  minimize  the  expected  number  of  transmissions  (ETX),  while  GeRaF 
uses  the  smallest  expected  geographical  distance  from  the  destination  as  a 
criterion  for  selecting  the  next-hop. 


“maintained”.  With  the  exception  of  [7],  which  provides  a 
sensitivity  analysis  of  opportunistic  routing  when  channel 
models  are  erroneous,  by  and  large,  the  question  of  learn¬ 
ing  and  estimating  channel  statistics  has  not  been  explored 
in  the  opportunistic  routing  context.  In  this  paper,  using  a 
reinforcement  learning  framework,  we  propose  an  adaptive 
opportunistic  routing  (AdaptOR)  algorithm  which  minimizes 
the  expected  average  per  packet  cost  when  zero  or  erroneous 
knowledge  of  transmission  success  probabilities  and  network 
topology  is  available. 

The  rest  of  the  paper  is  organized  as  follows:  In  Section  II, 
we  discuss  the  system  model  and  formulate  the  problem.  Sec¬ 
tion  III-A  formally  introduces  our  proposed  routing  algorithm. 
Adaptive  Opportunistic  Routing  (AdaptOR).  We  then  state  the 
optimality  theorem  for  AdaptOR  algorithm  in  Section  III-B.  In 
Section  IV,  we  analyze  the  convergence  and  optimality  of  the 
algorithm.  Finally,  we  conclude  the  paper  and  discuss  future 
work  in  Section  V. 

We  end  this  section  with  a  note  on  the  notations  used.  For  a 
vector  X  G  D  >  1,  we  use  x(l)  to  denote  the  element 
of  the  vector.  We  use  n~^  to  denote  the  time  just  after  the  start 
of  slot  [n,  n  -I-  1)  and  {n+  1)“  to  denote  the  time  just  before 
the  end  of  the  slot  [n,  n+  1). 

II.  System  Model 

We  consider  the  problem  of  routing  packets  from  the  source 
node  o  to  a  destination  node  d  in  a  wireless  ad-hoc  network 
of  d  -I-  1  nodes  denoted  by  the  set  0  =  {o,  1,  2,  •  •  •  ,  d}.  The 
time  is  slotted  and  indexed  by  n  >  0.  A  packet  indexed  by 
m  >  0  is  generated  at  the  source  node  o  at  time  rj"  according 
to  an  arbitrary  distribution  with  stabilizable  rate  A  >  0. 

We  assume  that  the  successful  reception  of  the  packet 
transmitted  by  a  node  occurs  according  to  a  fixed  conditional 
probability  distribution  over  the  set  of  nodes  in  the  network. 
Furthermore,  we  assume  that  successful  transmissions  over 
different  time  slots  are  independent  and  identically  distributed. 
In  particular  we  characterize  the  behavior  of  the  wireless 
channel  using  a  probabilistic  local  broadcast  model  [6].  The 
local  broadcast  model  is  defined  using  the  transition  prob¬ 
ability  P{S\i),S  C  0,f  G  0,  where  P{S\i)  denotes  the 
probability  of  successful  reception  of  packet  transmitted  by 
node  i  by  all  the  nodes  in  S.  Note  that  for  all  S  ^  S', 
successful  reception  at  S  and  S'  are  mutually  exclusive  and 
Ssce  Logically,  node  i  is  always  a  recipient  of 
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its  own  transmission,  i.e.  P{S\i)  =  0  if  i  ^  S.  Local  broadcast 
model  generalizes  the  notion  of  link  and  allows  for  correlation 
of  successful  receptions.  When  successful  transmission  to 
various  nodes  are  independent,  P{S\i)  can  be  written  as 
If j^sPij  where  0  <  P^J  <  1  represents  the  link  quality.  The 
successful  reception  of  the  packet  by  the  neighbors  is  assumed 
to  be  known  at  the  centralized  controller  with  zero  error  and 
propagation  delay. 

Given  a  successful  transmission  from  node  i  to  the  set 
of  nodes  S,  the  next  (possibly  randomized)  routing  decision 
includes  1)  retransmission  by  node  i,  2)  relaying  packet  by  a 
node  j  €  S,  or  3)  dropping  the  packet  all  together.  If  the 
controller  decides  to  use  node  j  for  relay,  then  node  j  is 
assumed  to  transmit  the  packet  at  the  next  slot,  while  other 
nodes  k  j,k  G  S  drop  that  packet. 

We  assume  upon  a  transmission  from  node  i  a  fixed 
transmission  cost  >  0  is  incurred.  Transmission  cost  Ci 
can  be  considered  to  model  the  amount  of  energy  used  for 
transmission,  the  expected  time  to  transmit  a  given  packet,  or 
the  hop  count  when  the  cost  is  equal  to  unity. 

We  define  the  termination  event  for  packet  m  to  be  the 
event  that  packet  m  is  either  received  by  the  destination  or  is 
dropped  by  a  relay  before  reaching  the  destination.  We  define 
termination  time  r™  to  be  a  random  variable  at  which  packet 
m  is  terminated.  We  discriminate  amongst  the  termination 
events  as  follows:  We  assume  that  upon  the  termination  of  a 
packet  at  the  destination  (successful  delivery  of  a  packet  to  the 
destination),  a  fixed  and  given  positive  reward  R  is  obtained, 
while  if  the  packet  is  terminated  (dropped)  before  it  reaches  the 
destination,  no  reward  is  obtained.  Let  denote  the  random 
reward  obtained  at  the  termination  time  t™,  i.e.  it  is  either 
zero  if  the  packet  is  dropped  prior  to  reaching  the  destination 
node  or  R  if  the  packet  is  received  at  the  destination. 

Given  the  assumptions  and  model,  the  routing  scheme  can 
be  viewed  as  selecting  a  (possibly  random)  sequence  of  nodes 
{*n,m}  for  relaying  packets  m  =  As  such,  the 

expected  average  per  packet  reward  associated  with  routing 
packets  along  sequence  of  {in,m}  is: 


lim  E 

N^OO 


1  Mn 

—  T 

Mn 

m—1 


(1) 


where  Mjv  denotes  the  number  of  packets  terminated  upto 
time  N,  in,m  denotes  the  index  of  the  node  which  transmits 
packet  m  at  time  n,  and  the  expectation  is  taken  over  the 
events  of  transmission  decisions,  successful  packet  receptions, 
and  packet  generation  times. ^ 

Problem  (P)  :  We  are  interested  in  maximizing  (1)  by 
choosing  the  sequence  of  relay  nodes  {in,m}  in  the  absence 
of  knowledge  about  the  local  broadcast  model. 

In  proposing  a  solution  to  the  Problem  (P),  we  will  need  the 
following  definitions  of  action  space,  state  space,  and  reward 


^Our  main  result  establishes  the  existence  of  an  optimal  policy  which 
maximizes  the  lim  in  (1)  This  is  a  strong  notion  of  optimality  and  implies 
that  the  proposed  algorithm’s  expected  average  reward  is  greater  than  the  best 
case  performance  (lim  sup)  of  all  policies  [8,  Page  344  ] . 


function  associated  with  each  packet  m.  The  set  of  all  actions, 
action  space,  is  given  by, 

^  =  0U{/}, 

i.e.  the  set  of  relay  nodes  along  with  the  termination  action  /. 
The  state  space  is  given  by  a  set  6, 

6  =  Ui^e{S-.P{S\i)>0}U{F}, 

denoting  the  sets  of  potential  reception  outcomes  from  every 
node  i  e  0  together  with  a  termination  state  F.  The  termina¬ 
tion  state  F  is  the  state  visited  by  the  system  when  termination 
action  /  is  chosen,  i.e.  P{F\f)  =  1.  Given  a  set  S  of  nodes 
that  have  received  a  packet  from  one  of  the  nodes  in  0,  the 
set  of  allowable  actions  is  denoted  by  A(5')  =  S'  U  {/}. 
The  allowable  action  in  the  termination  state  F  is  /,  i.e. 
A{F)  =  {/}.  Without  loss  of  generality,  the  allowable  action 
associated  with  any  set  S  G  =  {S  :  d  G  S,  S  G  &}  is 
restricted  to  /,  i.e.  A(S)  =  {/}. 

It  remains  to  define  the  reward  function  <7  :  ©  x  ^  ^  K  to 
represent  the  reward  obtained  from  taking  an  action  at  a  given 
state.  In  summary,  g{S,  a)  is  given  as: 

{—Ci  a  =  i  G  S 

R  a  =  f,SGZd. 

0  a  =  f  ,  SiZd 

Let  Sn,m  and  an,m  be  respectively  the  state  of  the  sys¬ 
tem  and  the  routing  decision  at  time  n  for  packet  m. 
Let  admissible  routing  policy  0  be  a  sequence  of  ac¬ 
tions  ar"*-!-!,™,  •  ■  ■  }  for  all  packets  m  taking  val¬ 

ues  on  the  allowable  action  space  A(S).  For  a  G  A(S), 
the  event  =  a}  belongs  to  the  cr-field  7f„  gener¬ 
ated  by  Uyj.j'[Tg  ,  ‘  ^n— l,m;  ^n,m^ 

for  all  m  such  that  r™  <  n.  Furthermore,  let  <!>  denote  the  set 
of  admissible  policies  for  Problem  (P). 

III.  The  Algorithm  and  Main  Results 
A.  Algorithm  AdaptOR 

In  this  section,  we  present  an  Adaptive  Opportunistic  Rout¬ 
ing  (AdaptOR)  algorithm  to  solve  Problem  (P).  At  each  time 
slot  n,  the  algorithm  uses  a  score  vector  A„  in  K",  where 
V  =  ^('^)  fo®  cardinality  of  the  domain  &  x  A. 

Remark  A„(5',  a)  evaluated  at  state  S  G  6  and  action  a  G 
A(S),  can  be  considered  to  be  an  estimate  of  the  expected 
reward  obtained  by  taking  action  a  at  state  S  at  time  slot  n. 

AdaptOR  is  parametrized  by  a  scaler  constant  0  <  7  <  1 
and  a  sequence  of  positive  scalers  During  any  time 

slot  [n,nH-l),  the  algorithm  uses  two  counting  random  variables 
Vn{S,a),  Nn{S),  and  two  random  sets  iVn  and  to  update 
the  iterate  A„.  Counting  random  variables  Vn{S,a)  and 
Nn{S)  are  equal  to  the  number  of  times  state-action  pair 
{S,  a)  and  state  S  have  been  reached  upto  time  n,  respectively. 
Random  set  FF„  C  0  denotes  the  set  of  transmitting  nodes 
during  time  slot  [n  —  l,n),  while  random  set  consists  of 
the  set  of  potential  relays  associated  with  transmissions  from 
nodes  in  Wn-i- 
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Random  counters  random  sets  y„,  Wn,  and  A„  are 

initialized  as  follows: 


r'o(-S',a)  =  0,No(S)  =  0, 


>o  =  {o},W^o  =  {o}, 


Ao(5',  a) 


-R  if  {S,  a)  =  {F,f) 
0  otherwise 


To  better  conceptualize  the  working  of  algorithm  AdaptOR, 
we  divide  the  execution  of  the  algorithm  into  three  stages  of 
reception,  adaptive  computation,  and  relay/transmission. 


1)  Reception  and  Acknowledgment  Stage: 

This  stage  is  assumed  to  occur  at  time  n.  Wn  F  0 
denotes  the  (random)  set  of  nodes  each  of  which  has 
transmitted  one  packet  at  time  n~ .  For  any  transmitter 
node  a  €  Wn,  let  S'®  denote  the  (random)  set  of 
nodes  that  have  successfully  received  the  packet  from 
node  a.  In  the  reception  and  acknowledgment  stage 
the  successful  reception  of  the  transmitted  packet  is 
acknowledged  by  all  the  nodes  in  the  set  5“  for  all 
a  G  Wn-  These  nodes  form  the  set  of  potential  relays 
for  node  a;  collectively  they  form  random  set  Yn+i, 
i.e. 

Yn+i  :=  :  Va  G  Wn}- 

Upon  reception  and  acknowledgment,  the  counting 
random  variables  are  incremented  as  follows: 


Nn(S)  = 


fN„.i(S)  +  l 


if  S'  G  y„+i 

if  s  ^  y„+i 


and 


fo  )  _  +  1  if  (S,  a)  G  Y„  X  Wn 

“  \!^„-i(S,a)  if  (S,a)  ^Y„  xWn  ' 

2)  Adaptive  Computation  Stage: 

This  stage  is  assumed  to  occur  at  n+.  In  this  stage, 
for  all  (S,  a)  G  Yn  X  Wn,  A„  is  updated  as  follows: 


A„(S,  a)  =  A„_i(S,  a)+ 

^  -  A„_i(S,a)  +  g(S,a) 

+  max  A„_i(S“,  j)  )  .  (2) 


For  the  state-action  pair  (S,  a)  ^  YnX  Wn,  A„  remains 
unchanged  as 

A„(S,  a)  =  An-i{S,a). 

3)  Relay/Transmission  Stage: 

This  stage  is  assumed  to  occur  at  (n  +  1)“.  In 
this  stage,  the  next  set  of  relay  nodes  (actions)  are 
selected.  In  particular,  for  all  S  G  Yn+i,  random  action 
af+i  G  A(S')  is  selected  according  to  the  following 
(randomized)  rule: 

•  with  probability  (1  —  e„(S')), 

af+i  G  argmaxA„(S',  j) 

J6A(S) 


is  selected,^  and 

•  with  probability  ®f+i  G  A{S)  is  selected 

randomly,  where 

=  7V„(5)  +  1  ■ 

At  time  (n+  1)“,  the  set  of  transmitters  Wn+i  =  {«  • 
VS'  G  y„+i,a  G  0  and  a  =  af+i}  is  updated. 

All  nodes  in  Wn+i  transmit  a  packet  at  time  (n+l)“. 


B.  Optimality  of  AdaptOR 

We  will  now  state  our  main  result  on  the  optimality  of 
AdaptOR,  (jf  G  Theorem  1  below  shows  that  the  expected 
reward  obtained  by  (/)* (AdaptOR)  maximizes  (1). 

Theorem  1.  For  all  </>  G 


lim 

N-t-oo 


1 

Mm 


E 


m—l 


> 


lim  sup 

N-t-OC) 


1 

Mm 


E 


m—l 


IV.  Proof 


In  this  section,  we  prove  the  optimality  of  AdaptOR  in  two 
steps.  In  the  first  step,  we  show  that  A„  converges  almost 
surely.  In  the  second  step  we  use  this  convergence  result  to 
show  that  AdaptOR  is  optimal  for  Problem  (P). 


A.  Convergence  of  An 

Let  U  :  M*'  ^  be  an  operator  on  vector  A  such  that, 
{UA){S,a)  =  g{S,a)  +  ^P(S''|a)^max^  A(5'',  j). 

Let  A*  G  ffi’'  denote  the  fixed  point  of  operator  U,'^  i.e. 

A*{S,a)  =  g{S,a)  +  ^P{S'\a)  rc^x  A*{S',j),  (3) 

A*  (A,/)  =  -R.  (4) 

The  following  theorem  establishes  the  convergence  of  recur¬ 
sion  (2)  to  the  fixed  point  of  U,  A*. 

Theorem  2.  Let 

(Jl)  Aof,.)  =  0  and  Ao{E,f)  =  -R, 

(J2)  J2Zo  =  OO’  E“o  <  oo- 

Then  iterate  A„  obtained  by  the  stochastic  recursion  (2) 
converges  to  A*  almost  surely. 

Proof:  The  proof  follows  using  known  results  on  the 
convergence  of  a  certain  super  martingale  process  presented 
in  Theorems  1,  2  in  [10].  The  detailed  proof  is  provided  in 
[9].  ■ 


^In  case  ambiguity,  node  with  smallest  index  is  chosen. 
** Existence  and  uniqueness  of  A*  is  provided  in  [9]. 
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B.  Proof  of  optimality 

Using  the  convergence  result  of  A„,  next  we  show  that  the 
expected  average  per  packet  reward  under  AdaptOR  is  equal 
to  the  optimal  expected  average  per  packet  reward  obtained 
for  a  genie-aided  system  where  the  local  broadcast  model  is 
known  perfectly. 

In  proving  the  optimality  of  AdaptOR  algorithm  for  Prob¬ 
lem  (P),  we  take  cue  from  known  results  of  a  closely  related 
Auxiliary  Problem  (AP)  wherein  the  controller  has  perfect 
knowledge  of  local  broadcast  model  as  presented  in  [1],  [6]. 

Let  Tn  be  the  product  cr-field  V  x  7f„  [11],  where  V  is  the 
borel  cr-field  generated  by  the  random  probability  measures 
for  the  local  broadcast  model.^  For  Auxiliary  Problem  (AP), 
let  admissible  routing  policy  tt  be  a  sequence  of  actions 
{ciT"‘,m,ciT"‘+i,mr  "}  for  packet  m  taking  values  on  the 
allowable  action  space  A(S')  such  that  the  event  {an,m  =  «} 
belongs  to  the  cr-field  iF„.  Furthermore,  let  If  denote  the  set 
of  admissible  policies  for  Auxiliary  Problem  (AP). 

The  reward  associated  with  policy  tt  G  If  for  routing  a 
single  packet  m  from  the  source  to  the  destination  is  then 
given  by 


J"({o}):=E- 

7 

\  ^  ^in,m 

,  [l-^o 

[  n^O 

J  J 

where  Tq  =  V,  and  the  expectation  E'^  is  taken  with  respect 
to  the  random  events  as  well  as  the  conditional  distributions 
over  action  space  defined  by  policy  tt.  Now,  in  this  setting,  we 
are  ready  to  formulate  the  following  Auxiliary  Problem  (AP) 
as  a  classical  shortest  path  Markov  decision  problem  (MDP). 

Auxiliary  Problem  (AP)  Find  an  optimal  policy  tt*  such 
that. 


=  sup  J^'iio}).  (6) 

TT^n 


Lastly,  V*{j)  is  the  maximum  expected  reward  for  routing  a 
packet  from  node  j  to  destination  d: 

v*ij)  =  r\{j})  =  snpr{{j}). 

TT^n 


Lemma  1  below  states  the  relationship  between  the  solution 
of  Problem  (P)  and  that  of  the  Auxiliary  Problem  (AP).  More 
specifically.  Lemma  1  shows  that  V*{o)  is  an  upper  bound  for 
the  solution  to  Problem  (P). 

Lemma  1.  Consider  any  admissible  policy  </>  G  T*  for 
Problem  (P).  Then  for  all  N  =  1,2,  ■■■ 


E<l> 


1 

Mn 


E 


m—l 


<  V*{o). 


Proof:  The  proof  is  given  in  [9].  Intuitively  the  result 
holds  because  the  set  of  admissible  policies  T*  in  (P)  is  a 
subset  of  admissible  policies  If  in  (AP).  ■ 

Lemma  2  gives  the  achievability  proof  for  Problem  (P) 
by  showing  that  the  expected  average  per  packet  reward  of 
AdaptOR  is  no  less  than  V*{o). 

Lemma  2.  For  any  5  >  0, 


lim  inf  E'^ 

N^OO 


1 

Mff 


E 


m—l 


>  V*{o)-S'. 


Proof:  The  proof  is  given  in  Appendix  A.  ■ 

Lemmas  1  and  2  imply  that 


lim  E’!’^ 

N^oo 


1 

Mn 


E 


m—l 


Auxiliary  Problem  (AP)  has  been  extensively  studied  in  [1], 
[6],  [12]  and  the  following  theorem  is  established  in  [6]. 

Fact  1  (Theorem  2.1  [6]).  There  exists  a  function  tt*  ^  A 
such  that  the  policy  an,m  =  Tt*{Sn,m)  is  an  optimal  solution 
for  the  Auxiliary  Problem  (AP).®  Furthermore,  tt*  is  such  that 

7r*(S')  G  argmaxU*(j),  (7) 

P^A(S) 

where  (value)  function  U*  :  ^  ^  M  is  the  unique  solution  to 
the  following  fixed  point  equation; 

V*{d)  =  R  (8) 

V*{i)  =  max({-Ci -f  VP(S"|z)(m;KU*(j))},0)(9) 

V*{f)  =  0.  (10) 

^(T-field  captures  the  knowledge  of  the  realization  of  local  broadcast  model 
and  assumes  a  well-defined  prior  on  these  models. 

®In  other  words  there  exists  a  stationary,  deterministic,  and  Markov  optimal 
policy  for  Auxiliary  Problem  (AP). 


exists  and  is  equal  to  V*{o).  This  together  with  Lemma  1 
establishes  the  proof  of  Theorem  1 . 

V.  Conclusions 

In  this  paper,  we  proposed  an  adaptive  opportunistic  routing 
scheme  which  maximizes  the  expected  average  per  packet 
reward  from  the  source  to  the  destination  in  absence  of 
knowledge  regarding  network  topology  and  link  qualities. 

We  would  like  to  point  out  that  AdaptOR  can  be  readily 
extended  to  scenarios  in  which  the  routing  decisions  and 
computations  are  done  in  a  decentralized  and  asynchronous 
manner.  We  refer  interested  readers  to  [9]. 

The  broadcast  model  used  in  this  paper  assumes  a  decou¬ 
pled  operation  at  the  MAC  and  network  layer.  While  this 
assumption  seems  reasonable  for  many  popular  MAC  schemes 
based  on  random  access  philosophy,  it  ignores  the  potentially 
rich  interplays  between  scheduling  and  routing  which  arises  in 
many  TDM  based  schemes  such  as  [13].  The  joint  design  of 
MAC  and  routing  remains  an  important  area  of  future  research. 
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Appendix 


A.  Proof  of  Lemma  2 

Proof:  From  (3),  (4),  (8),  (9)  we  obtain  the  following 
equality 


argmaxy*(j)  =  argmax  A*(S',  j). 

jeA{S)  jeAiS) 


Let 


b  =  min  min 

See  i,j€A(s) 

A*(S,»)#A*(S.i) 


2 


(11) 


(12) 


Theorem  2  implies  that,  in  an  almost  sure  sense,  there  exists 
packet  index  mi  <  oo  such  that  for  all  n  > 


|A„(5,a)  -  A*(5,a)|  <  5  G  6,  a  G  A(S').  (13) 


iTioRmax  <  OO  and  lower-bounded  by  —'^dmaxiCi  >  — oo, 
hence  their  presence  does  not  impact  the  expected  average 
reward.  Consequently,  we  only  need  to  consider  the  errors 
due  to  random  decisions  of  policy  f*  (exploration)  for  packets 
m  >  mo. 

Consider  the  m*^  packet  generated  at  the  source.  Let 
be  an  event  for  which  there  exist  k  instances  at  which  routing 
algorithm  routes  packet  m  differently  from  the  possible  set  of 
optimal  actions.  Mathematically  speaking,  event  B™  occurs 
iff  there  exists  instances  <  nf  <  n™  •  •  •  n™  <  r™  such 
that  for  alH  =  1,  2,  •  •  •  ,k 

A*(S'„™,a„™)  max  A*(S'„™,j), 

3<^A{S^Y') 

where  S'„™  is  the  set  of  nodes  which  have  successfully 
received  packet  m  at  time  n[".  We  call  such  events  a 
mis-routing  of  order  k.  It  is  straight-forward  to  show  that  for 

m  >  mo, 

Prob{B]f)  <  5’^. 

For  any  packet  m,  m  >  mo,  let  us  consider  the  expected 
differential  reward  under  policies  tt*  and  (j>*: 


k=0 


V*(o) 


xProb{B^) 


\Bl 


<  ^k  R  Prob{B^)  (14) 

k^O 

oo 

<  (15) 

k^l 

=  5',  (16) 

where  5'  =  ff  y  ■  Inequality  (14)  is  obtained  by  noticing 

that  maximum  loss  in  the  reward  occurs  if  algorithm  AdaptOR 
decides  to  drop  packet  m  (no  reward)  while  there  exists  a  node 
j  in  the  set  of  potential  forwarders  such  that  V*{j)  ~  R. 

Thus  the  expected  average  per  packet  reward  under  policy 
(jf  is  bounded  as 


Therefore,  from  time  onwards,  given  any  set  of  S,  prob¬ 
ability  that  algorithm  AdaptOR  chooses  an  action  a  G  A(S') 
such  that  A* (S',  a)  ^  maxjg^(5)  A*(S,  j)  is  upper  bounded 
by  en{S).  Furthermore,  since  each  state  is  visited  infinitely 
often  [9]  (Nn{S)  -a  oo)  there  exists  packet  index  m2  <  00 
almost  surely  such  that  for  all  n  >  maxg  e„(S)  <  6  for 
a  given  (5  >  0. 

Let  mo  =  max{mi,m2}.  For  all  packets  with  index 
m  <  mo  the  overall  expected  reward  is  upper-bounded  by 


lim  inf  E‘^ 

A— ^00 


1 

Mjv 


E 


m—l 


> 

> 


eEi(e*(o)-u 


lim  inf 
AT— ^00 


Mm 


V*{o)-5'. 


